Best Tools for JAVA

home *** CD-ROM | disk | FTP | other *** search

/ Best Tools for JAVA / Best Tools for JAVA.iso / CONVERTR / PARSEHTM / README.TXT < prev next >

Wrap

Text File | 1995-09-18 | 2.6 KB | 67 lines

Cheap HTML parser Jim Davis davis@dri.cornell.edu July 1994 This is code for doing simple processing on HTML. I know there are bugs and limitations in the code, but it suffices for simple purposes. Among the limitations: This is an HTML parser, not an SGML parser - it does not accept a DTD, rather the model of HTML is built into the code. Also it does not validate the HTML - it will attempt to parse invalid documents, and the results are undefined if the document is in error. The source code is available as a compressed Unix tar file. It runs under perl 4.0 patch level 36. I don't know about other versions of perl. This directory contains: parse-html.pl A simple HTML parser written in perl. As it parses the HTML, it calls routines (which you may redefine) for each tag encountered, and for whitespace and content. You can redefine these routines so as to process the HTML document. html-to-ascii.pl Uses the HTML parser to generate a plain ASCII version of an HTML document. html-ascii.pl The actual routines to generate the ASCII. tformat.pl A lowlevel text formatter used for generating ASCII. More or less like a subset of nroff html-to-rfc.pl Uses the HTML parser to generate a plain ASCII version of an HTML, with special formatting requirements for Internet drafts and RFCs rfc.pl Additional routines required for RFC formatting (e.g. page headers and footers) Generating RFCs from HTML The RFC format requires there be a header and footer containing, among other things, the name of the authors, a short title, and so on. You specify values for these fields with META tags as shown by the following example. <META name="status" content="Internet Draft"> <META name="title" content="Internet audio protocol"> <META name="date" content="July 1983"> <META name="author" content="Nixon, Haldeman"> (The META tag is not officially part of HTML, it was proposed by Roy Fielding.) The tags should be in the HEAD. Known bugs * It can't parse the prolog (or whatever you call it) because it does not know how to ensure that the square brackets match, e.g. the following <!DOCTYPE HTML [ <!entity % HTML.Minimal "INCLUDE"< <!-- Include standard HTML DTD --< <!ENTITY % html PUBLIC "-//connolly hal.com//DTD WWW HTML 1.8//EN"< %html; ]< * font tags (e.g. CODE, EM) cause an extra whitespace in output e.g. <TT>foo</TT> yields "foo ,".